Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Enable inlining for late devirtualization #110827

Merged
merged 61 commits into from
Jan 29, 2025

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Dec 18, 2024

If we see a new inline candidate after late devirtualization, we can try marking and inlining it.
This unblocks the inlining and stack-allocating arbitrary ref-class enumerators (unless the inliner considers GetEnumerator as non-profitable). We might need to tune the inliner heuristics later to get more profitable inlining opportunities.

Contributes to #7541 and #108913

There're some really nice diffs: https://gist.github.com/MihuBot/29f7c64533ac1f38494fbfab361ab505

Example:

[MethodImpl(MethodImplOptions.NoInlining)]
static float GetDistance()
{
    IVector p1 = Vector.GetVector(4, 2);
    IVector p2 = Vector.GetVector(1, 6);
    IVector dir = p2.Sub(p1);
    return MathF.Sqrt(dir.X * dir.X + dir.Y * dir.Y);
}

struct Vector(float x, float y) : IVector
{
    public float X { get; set; } = x;
    public float Y { get; set; } = y;
    public readonly IVector Sub(IVector other) => GetVector(X - other.X, Y - other.Y);
    public static IVector GetVector(float x, float y) => new Vector(x, y);
}

interface IVector
{
    IVector Sub(IVector other);
    float X { get; set; }
    float Y { get; set; }
}

Before:

; Assembly listing for method Program:GetDistance():int (FullOpts)
G_M12138_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 48
                                                ;; size=5 bbWeight=1 PerfScore 1.25
G_M12138_IG02:  ;; offset=0x0005
       mov      rcx, 0x7FFD37F391F8      ; Vector
       call     CORINFO_HELP_NEWSFAST
       mov      rdx, rax
       mov      rcx, 0x4000000040800000
       mov      qword ptr [rdx+0x08], rcx
       mov      dword ptr [rsp+0x20], 0x3F800000
       mov      dword ptr [rsp+0x24], 0x40C00000
       lea      rcx, [rsp+0x20]
       call     [Vector:Sub(IVector):IVector:this]
       mov      rbx, rax
       mov      rcx, rbx
       mov      r11, 0x7FFD36EE0370      ; code for IVector:get_X():float:this
       call     [r11]IVector:get_X():float:this
       vmovss   dword ptr [rsp+0x2C], xmm0
       mov      rcx, rbx
       mov      r11, 0x7FFD36EE0378      ; code for IVector:get_X():float:this
       call     [r11]IVector:get_X():float:this
       vmulss   xmm0, xmm0, dword ptr [rsp+0x2C]
       vmovss   dword ptr [rsp+0x2C], xmm0
       mov      rcx, rbx
       mov      r11, 0x7FFD36EE0380      ; code for IVector:get_Y():float:this
       call     [r11]IVector:get_Y():float:this
       vmovss   dword ptr [rsp+0x28], xmm0
       mov      rcx, rbx
       mov      r11, 0x7FFD36EE0388      ; code for IVector:get_Y():float:this
       call     [r11]IVector:get_Y():float:this
       vmulss   xmm0, xmm0, dword ptr [rsp+0x28]
       vaddss   xmm0, xmm0, dword ptr [rsp+0x2C]
       vsqrtss  xmm0, xmm0, xmm0
                                                ;; size=166 bbWeight=1 PerfScore 51.50
G_M12138_IG03:  ;; offset=0x00AB
       add      rsp, 48
       pop      rbx
       ret
                                                ;; size=6 bbWeight=1 PerfScore 1.75

GDV:

; Assembly listing for method Program:GetDistance():int (Tier1)
G_M12138_IG01:  ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
                                                ;; size=6 bbWeight=1 PerfScore 2.25
G_M12138_IG02:  ;; offset=0x0006
       mov      rbx, 0x7FFD4531D278      ; Vector
       mov      rcx, rbx
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       mov      rcx, 0x4000000040800000
       mov      qword ptr [rsi+0x08], rcx
       mov      rcx, rbx
       call     CORINFO_HELP_NEWSFAST
       mov      rcx, 0x40C000003F800000
       mov      qword ptr [rax+0x08], rcx
       add      rax, 8
       vmovss   xmm0, dword ptr [rax]
       vsubss   xmm0, xmm0, dword ptr [rsi+0x08]
       vmovss   xmm1, dword ptr [rax+0x04]
       vsubss   xmm1, xmm1, dword ptr [rsi+0x0C]
       vmovss   dword ptr [rsp+0x24], xmm0
       vmovss   dword ptr [rsp+0x20], xmm1
       mov      rcx, rbx
       call     CORINFO_HELP_NEWSFAST
       vmovss   xmm0, dword ptr [rsp+0x24]
       vmovss   dword ptr [rax+0x08], xmm0
       vmovss   xmm1, dword ptr [rsp+0x20]
       vmovss   dword ptr [rax+0x0C], xmm1
       vmovss   xmm0, dword ptr [rax+0x08]
       vmovaps  xmm1, xmm0
       vmulss   xmm0, xmm1, xmm0
       vmovss   xmm1, dword ptr [rax+0x0C]
       vmovaps  xmm2, xmm1
       vmulss   xmm1, xmm2, xmm1
       vaddss   xmm0, xmm1, xmm0
       vsqrtss  xmm0, xmm0, xmm0
                                                ;; size=156 bbWeight=1 PerfScore 67.50
G_M12138_IG03:  ;; offset=0x00A2
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret
                                                ;; size=7 bbWeight=1 PerfScore 2.25

After:

; Assembly listing for method Program:GetDistance():int (FullOpts)
G_M12138_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M12138_IG02:  ;; offset=0x0000
       vmovss   xmm0, dword ptr [reloc @RWD00]
                                                ;; size=8 bbWeight=1 PerfScore 3.00
G_M12138_IG03:  ;; offset=0x0008
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00
RWD00   dd      40A00000h               ;         5

/cc: @AndyAyersMS

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 20, 2025

@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 20, 2025

@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 20, 2025

cc @AndyAyersMS
By enabling this we managed to inline all kinds of ref-class enumerators and stack allocated them: https://gist.github.com/MihuBot/0a4cb4714afc1ff5fe474f99df1ce6dd

@hez2010 hez2010 marked this pull request as ready for review January 21, 2025 10:48
@AndyAyersMS
Copy link
Member

Skimmed the changes and it seems promising. I'll take a deeper look soon.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 26, 2025

Seems the check doesn't add any meaningful TP (before vs after).
We can't see the real asmdiffs due to missing contexts. Instead, let's check MihuBot for diffs.

@MihuBot

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in good shape, just one last small thing.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM -- @jakobbotsch any other feedback?

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 28, 2025

I'm testing TP impact of splitting trees in #111896. If it turns out that the TP impact has the similar pattern with the check in this PR, I will merge it into this branch.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well as is

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 28, 2025

Seems that TP with splitting the tree doesn't have meaningful change: https://dev.azure.com/dnceng-public/public/_build/results?buildId=931974&view=ms.vss-build-web.run-extensions-tab

Merged the commits into this PR.

{
Statement* newStmt = nullptr;
GenTree** callUse = nullptr;
if (m_compiler->gtSplitTree(m_compiler->compCurBB, m_curStmt, call, &newStmt, &callUse))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtSplitTree needs to learn that it has to split lvHasLdAddrOp locals when it runs early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
Statement* newStmt = nullptr;
GenTree** callUse = nullptr;
if (m_compiler->gtSplitTree(m_compiler->compCurBB, m_curStmt, call, &newStmt, &callUse, true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using context != nullptr as a check if devirtualization succeeded? A better check is !call->IsVirtual().

We should avoid splitting the tree if devirtualization fails, and also if the call can't be made into an inline candidate (seems like impMarkInlineCandidate can be called on a non-root call...?)

So maybe reorder things a bit.

I'd also be fine if you revert all this and go back to the non-split version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's not obvious that splitting a tree that we're walking will work out as expected. I think it's ok here, especially since we back up and will reprocess the current tree, but it feels like it would be cleaner if we aborted the walk at this point.

Copy link
Contributor Author

@hez2010 hez2010 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah context will be nullptr if devirtualization failed.
Switched to use !call.IsVirtual() instead. The context will be used while calling impMarkInlineCandidate so I added an assertion to make sure it's correctly set.

Resolved.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Will give @jakobbotsch one last chance to look too.

{
Statement* newStmt = nullptr;
GenTree** callUse = nullptr;
if (m_compiler->gtSplitTree(m_compiler->compCurBB, m_curStmt, call, &newStmt, &callUse, true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this split is unnecessarily conservative: it will introduce temps for the arguments of the call, which can bypass optimizations the inliner can make depending on the specific shape of the argument. Fixing that requires enhancing gtSplitTree a bit (see e.g. my version in this branch), however, even with that done you will run into issues around incorrect treatment of the retbuffer in the inliner. So I wouldn't suggest trying to do anything about it at this point, but it is a potential future improvement.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndyAyersMS AndyAyersMS merged commit a9d67ec into dotnet:main Jan 29, 2025
112 checks passed
@AndyAyersMS
Copy link
Member

@hez2010 thanks again!

grendello added a commit to grendello/runtime that referenced this pull request Jan 30, 2025
* main:
  Make cdac APIs public but experimental (dotnet#111180)
  JIT: Enable inlining for late devirtualization (dotnet#110827)
  Remove unsafe `bool` casts (dotnet#111024)
  Fix shimmed implementation of TryGetHashAndReset to handle HMAC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants